By: Addison Farber, Alvaro Gonzalez, Brian Bombasi, Stephen Smart
Advisor: Jeremy Morris
Swire Coca-Cola is seeking to improve its ability to predict the success of new restaurants in its market. To achieve this goal, we will analyze census data, consumer reviews, customer attributes, and Swire's sales to these customers. The data will be used to build a predictive model that can estimate the popularity, longevity, and total sales volume of new customers based on historical results. The project will deliver a predictive model, a report on the methodology and results, a software implementation of the model, training and support for stakeholders on how to use and interpret the results, an interactive dashboard to visualize and explore the results of the model, and a live presentation to showcase the findings and explain the methodology and results.
The project will be completed by a group of 4 MSBA students as a castone project and is expected to be completed by mid-April. The goal of this project is to provide Swire Coca-Cola with a data-driven approach to making informed decisions on pricing and funding, maximizing profitability, and minimizing risk. The interactive dashboard and live presentation will be essential tools for stakeholders to understand and make use of the results of the predictive model.
We will create a new column that will classify if a customer was active at the moment of the data pulling. Comparing if in the last 3 months they had a posting or not will let us know if it is active customer.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import pandas_profiling as pp
import warnings
warnings.filterwarnings('ignore')
from sqlalchemy import create_engine
##read FFSOP_Customer_Data_v2.0.xlsb
df = pd.read_csv(r'C:\Users\Alvaro\Documents\MSBA\FSOP_Customer_Data_v2.0.csv',parse_dates=['ON_BOARDING_DATE'])
##read FFSOP_Sales_Data_v2.0.xlsb
df2 = pd.read_csv(r'C:\Users\Alvaro\Documents\MSBA\FSOP_Sales_Data_v2.0.csv',parse_dates=['MIN_POSTING_DATE','MAX_POSTING_DATE'])
##create tables in postgres
#df.to_sql('customer',engine,if_exists='replace')
#df2.to_sql('sales',engine,if_exists='replace')
#connect to postgres
engine = create_engine('postgresql://capstone:fFmDDtgjBotGbt5roo8eKfgQt@capstonedb.postgres.database.azure.com:5432/postgres')
#what is in the database
engine.table_names()
# read customer and sales data from postgres
df = pd.read_sql_table('customer',engine)
df2 = pd.read_sql_table('sales',engine)
C:\Users\Alvaro\AppData\Local\Temp\ipykernel_36440\503689038.py:11: DeprecationWarning: `import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead.
['product', 'customer', 'sales', 'customer_features']
pp.ProfileReport(df)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
pp.ProfileReport(df2)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
We will create a DF4 wich will compile the most valuble features per customer
#creat a new column with the max date group by customer_number_blinded
df2['MAX_DATE'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['MAX_POSTING_DATE'].transform('max')
#creat a new column with the min date group by customer_number_blinded
df2['MIN_DATE'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['MIN_POSTING_DATE'].transform('min')
#difference between max and min date
df2['DIFF_DATE'] = (df2['MAX_DATE'] - df2['MIN_DATE']).dt.days
#if df2['DIFF_DATE'] == 0 then 1 day else df2['DIFF_DATE']
df2['DIFF_DATE'] = np.where(df2['DIFF_DATE'] == 0, 1 , df2['DIFF_DATE'])
#convert the difference to months and round to 0 decimals
df2['years_diff'] = round(df2['DIFF_DATE']/360,4)
df2['MONTHS_ACTIVE'] = round(df2['DIFF_DATE']/30,4)
# create a new column that tells if the customer is active or not depending if max date > 2022-9-30
df2['ACTIVE'] = np.where(df2['MAX_DATE'] > '2022-9-30', 'Active', 'Inactive')
# create a new column that adds the total of invoices per customer_number_blinded
df2['TOTAL_INVOICES'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['INVOICE_PRICE'].transform('sum')
# crerate a new column MRR that is INVOICE_PRICE/MONTHS_ACTIVE by customer_number_blinded
df2['TMRR'] = df2['TOTAL_INVOICES']/df2['MONTHS_ACTIVE']
# create a new column that counts the unique number of products by customer_number_blinded
df2['UNIQUE_PRODUCTS'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['PRODUCT_SOLD_BLINDED'].transform('nunique')
# SUM NUMBER OF TRANSACTIONS BY CUSTOMER_NUMBER_BLINDED BY years
df2['EXPECTED_TRANS_YEAR'] = df2.groupby(['CUSTOMER_NUMBER_BLINDED'])['NUM_OF_TRANSACTIONS'].transform('sum')/df2['years_diff']
# create new column total GROSS_PROFIT_DEAD_NET by customer_number_blinded
df2['TOTAL_PROFIT'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['GROSS_PROFIT_DEAD_NET'].transform('sum')
# create new column total volume by customer_number_blinded
df2['TOTAL_VOLUME'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['PHYSICAL_VOLUME'].transform('sum')
#CREATE A NEW DF WITH THE UNIQUE CUSTOMER_NUMBER_BLINDED AND TMRR, MONTHS_ACTIVE, ACTIVE drop duplicates and na values
df3 = df2[['CUSTOMER_NUMBER_BLINDED','TMRR','ACTIVE','UNIQUE_PRODUCTS','EXPECTED_TRANS_YEAR','MAX_DATE','TOTAL_PROFIT','TOTAL_VOLUME','TOTAL_INVOICES','years_diff','MONTHS_ACTIVE']].drop_duplicates().dropna()
#filter mrr > 0 & months active > 0
df3 = df3[(df3['TMRR'] > 0)]
#join df3 with df
df4 = df3.merge(df, on='CUSTOMER_NUMBER_BLINDED', how='left')
#split SALES_OFFICE_DESCRIPTION in sales city and sales state
df4['SALES_CITY'] = df4['SALES_OFFICE_DESCRIPTION'].str.split(',').str[0]
df4['SALES_STATE'] = df4['SALES_OFFICE_DESCRIPTION'].str.split(',').str[1]
#split DELIVERY_PLANT_DESCRIPTION in delivery city and delivery state
df4['DELIVERY_CITY'] = df4['DELIVERY_PLANT_DESCRIPTION'].str.split(',').str[0]
df4['DELIVERY_STATE'] = df4['DELIVERY_PLANT_DESCRIPTION'].str.split(',').str[1]
#split the zipcode in zipcode and zipcode plus 4
df4['ZIPCODE'] = df4['ADDRESS_ZIP_CODE'].str.split('-').str[0]
#plot number of customers active and inactive in df3
fig = px.histogram(df3, x="ACTIVE", color="ACTIVE", title='Number of Customers Active and Inactive')
fig.show()
If we consider customers to be inactive when not posting in the last 3 months we have an imbalanced data set where a significant portion of customers are considered active, it is important to consider how this may impact the accuracy of our future predictive model. One potential next step is to try to balance the data set by oversampling the minority class (active customers) or undersampling the majority class (inactive customers). Another approach is to use different evaluation metrics, such as precision and recall, which take into account the false negatives and false positives in the predictions. Additionally, it may be useful to explore different machine learning algorithms that are specifically designed to handle imbalanced data sets, such as decision trees or random forests. Overall, addressing imbalanced data sets is an important consideration in building an effective predictive model that can accurately identify and target active customers.
#histogram TMRR by active and inactive customers with a bin size of 100 and median line and notation
fig = px.histogram(df3, x="TMRR", color="ACTIVE", title='TMRR by Active and Inactive Customers', histnorm='probability density', marginal='box', hover_data=df3.columns)
fig.show()
#histogram of unique products by active and inactive customers
fig = px.histogram(df3, x="UNIQUE_PRODUCTS", color="ACTIVE", title='Unique Products by Active and Inactive Customers',histnorm='probability density', marginal='box', hover_data=df3.columns)
fig.show()
#histogram of expected transactions by active and inactive customers
fig = px.histogram(df3, x='EXPECTED_TRANS_YEAR', color="ACTIVE", title='Expected transactions per year by Active and Inactive Customers',histnorm='probability density', marginal='box', hover_data=df3.columns)
fig.show()
Here we will explore where the customers are from and what type of bussiness they have, who is there delivery or sales office.
#create a tmrr/max tmrr column
df4['TMRR_MAX'] = df4['TMRR']/df4['TMRR'].max()*10000
#map of customers by location and color by active and inactive
fig = px.scatter_geo(df4, lat="GEO_LATITUDE", lon="GEO_LONGITUDE", color="ACTIVE", size="TMRR_MAX",\
hover_name="CUSTOMER_NUMBER_BLINDED", hover_data=["TMRR", "UNIQUE_PRODUCTS"], \
title='Map of Customers by Location and Color by Active and Inactive')
fig.update_layout(
title = 'Map of Customers by Location and Color by Active and Inactive',
showlegend = True,
geo = dict(
scope = 'usa',
landcolor = 'rgb(217, 217, 217)',
)
)
fig.show()
In this map we can see the most inactive customers with big MRR are in the arizona Zone and the highest MRR active customers are in Colorado followed by Seattle and Oregon.
#count of customers by ADDRESS_City and ACTIVE and plot top 20
fig = px.bar(df4.groupby(['ADDRESS_CITY','ACTIVE'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index().sort_values('CUSTOMER_NUMBER_BLINDED', ascending=False).head(100)
, x="ADDRESS_CITY", y="CUSTOMER_NUMBER_BLINDED", color="ACTIVE", title='Count of Customers by ADDRESS_City and ACTIVE')
fig.show()
#count of customers by sales_State and active
fig = px.bar(df4.groupby(['SALES_STATE','ACTIVE'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index().sort_values('CUSTOMER_NUMBER_BLINDED', ascending=False)
, x="SALES_STATE", y="CUSTOMER_NUMBER_BLINDED", color="ACTIVE", title='Count of Customers by SALES_STATE')
fig.show()
#count of customers by delivery_State and ACTIVE
fig = px.bar(df4.groupby(['DELIVERY_STATE','ACTIVE'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index().sort_values('CUSTOMER_NUMBER_BLINDED', ascending=False)
, x="DELIVERY_STATE", y="CUSTOMER_NUMBER_BLINDED", color="ACTIVE", title='Count of Customers by DELIVERY_STATE and ACTIVE')
fig.show()
We can see there is a relation between the columns of sales, delivery and customer location. So in the moment of creating the model we can do a better selection of features.
What is the business line of the customers?
#plot customer trade channel by active and inactive sorted by count
fig = px.histogram(df4, x="CUSTOMER_TRADE_CHANNEL_DESCRIPTION", color="ACTIVE", title='Customer Trade Channel by Active and Inactive Customers',\
histnorm='probability density', hover_data=df4.columns, barmode='group',\
category_orders={"CUSTOMER_TRADE_CHANNEL_DESCRIPTION": df4['CUSTOMER_TRADE_CHANNEL_DESCRIPTION'].value_counts().index})
fig.show()
#PLOT 'MARKET_DESCRIPTION' BY ACTIVE AND INACTIVE CUSTOMERS
fig = px.histogram(df4, x="COLD_DRINK_CHANNEL_DESCRIPTION", color="ACTIVE", title='Market Description by Active and Inactive Customers',\
histnorm='probability density', hover_data=df4.columns, barmode='group',\
category_orders={"COLD_DRINK_CHANNEL_DESCRIPTION": df4['COLD_DRINK_CHANNEL_DESCRIPTION'].value_counts().index})
fig.show()
This may show that a customer in Full service , Amusment and recreation can trend to be more inactive.
# plot averag TMRR by market description
fig = px.bar(df4.groupby(['COLD_DRINK_CHANNEL_DESCRIPTION'])['TMRR'].median().reset_index().sort_values('TMRR', ascending=False)
, x="COLD_DRINK_CHANNEL_DESCRIPTION", y="TMRR", title='Median TMRR by Market Description')
fig.show()
#plot mean TMRR by customer trade channel description
fig = px.bar(df4.groupby(['CUSTOMER_TRADE_CHANNEL_DESCRIPTION'])['TMRR'].median().reset_index().sort_values('TMRR', ascending=False)
, x="CUSTOMER_TRADE_CHANNEL_DESCRIPTION", y="TMRR", title='Median TMRR by Customer Trade Channel Description')
fig.show()
#plot mean TMRR by sales state
fig = px.bar(df4.groupby(['SALES_STATE'])['TMRR'].median().reset_index().sort_values('TMRR', ascending=False)
, x="SALES_STATE", y="TMRR", title='Median TMRR by Sales State')
fig.show()
Supermarkets , wholesale and transportation are the type of customers that generate more revenue.
#plot 'MARKET_DESCRIPTION' BY ACTIVE AND INACTIVE CUSTOMERS
fig = px.histogram(df4, x="COLD_DRINK_CHANNEL_DESCRIPTION", color="ACTIVE", title='Market Description by Active and Inactive Customers',\
histnorm='probability density', hover_data=df4.columns, barmode='group',\
category_orders={"COLD_DRINK_CHANNEL_DESCRIPTION": df4['COLD_DRINK_CHANNEL_DESCRIPTION'].value_counts().index})
fig.show()
#plot customer trade channel by active and inactive sorted by count
fig = px.histogram(df4, x="CUSTOMER_TRADE_CHANNEL_DESCRIPTION", color="ACTIVE", title='Customer Trade Channel by Active and Inactive Customers',\
histnorm='probability density', hover_data=df4.columns, barmode='group',\
category_orders={"CUSTOMER_TRADE_CHANNEL_DESCRIPTION": df4['CUSTOMER_TRADE_CHANNEL_DESCRIPTION'].value_counts().index})
fig.show()
In this part of the project, we will model to identify a customer who is worthy of a discount to sign a contract or not. When a customer is at risk of default, the business will lose money because there was no guarantee of a return on investment.
To achieve this goal, we will create clusters of products to analyze the behavior of each customer on each of these product clusters. Additionally, we will cluster the locations to determine if they are in a favorable or unfavorable place, and we will calculate the customer's lifetime value. Cluster models used
We will use the following models to predict the risk of a customer defaulting on a contract:
This is because we are trying to predict a binary outcome (customer will have risk low or high ) and we have a relatively small dataset. Risk is defined by if customer has low Profit life time value or has not orderd in the last 3 months and has les than 3 years.
We have discovered which are the features and how they intercat between them. Now we will create a DF selecting the features that will be useful to predict the value, status of a customer and if is risky for doing business.
We want to know the types of products and how they are consumed by the customers. We will cluster the 1558 product byits similarity on:
#new dataframe with average of NUM_OF_TRANSACTIONS by 'PRODUCT_SOLD_BLINDED'
df5 = df2.groupby(['PRODUCT_SOLD_BLINDED'])['NUM_OF_TRANSACTIONS'].mean().reset_index().sort_values('NUM_OF_TRANSACTIONS', ascending=False)
# inclued count of customers by 'PRODUCT_SOLD_BLINDED'
df5['#Customers'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index()['CUSTOMER_NUMBER_BLINDED']
# Average of 'GROSS_PROFIT_DEAD_NET ' by 'PRODUCT_SOLD_BLINDED'
df5['AVG_GROSS_PROFIT_DEAD_NET'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['GROSS_PROFIT_DEAD_NET'].mean().reset_index()['GROSS_PROFIT_DEAD_NET']
# log average volume by 'PRODUCT_SOLD_BLINDED'
df5['AVG_VOLUME'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['PHYSICAL_VOLUME'].mean().reset_index()['PHYSICAL_VOLUME']
# average Revenue by 'PRODUCT_SOLD_BLINDED'
df5['AVG_INVOICE'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['INVOICE_PRICE'].mean().reset_index()['INVOICE_PRICE']
# include 'BEV_CAT_DESC', 'CALORIE_CAT_DESC', 'PACK_TYPE_DESC', 'PACK_SIZE_SALES_UNIT_DESCRIPTION' in df5
df5 = df5.merge(df2[['PRODUCT_SOLD_BLINDED','BEV_CAT_DESC', 'CALORIE_CAT_DESC', 'PACK_TYPE_DESC', 'PACK_SIZE_SALES_UNIT_DESCRIPTION']], on='PRODUCT_SOLD_BLINDED', how='left')
df5 = df5.drop_duplicates()
df5.columns = df5.columns.astype(str)
df5.head()
| PRODUCT_SOLD_BLINDED | NUM_OF_TRANSACTIONS | #Customers | AVG_GROSS_PROFIT_DEAD_NET | AVG_VOLUME | AVG_INVOICE | BEV_CAT_DESC | CALORIE_CAT_DESC | PACK_TYPE_DESC | PACK_SIZE_SALES_UNIT_DESCRIPTION | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M084802150628 | 128.200000 | 5 | 2388.546000 | 243.400000 | 15020.214000 | NaN | NaN | Paper Cup | 16 OZ 1-Ls 1000 |
| 5 | M014603040787 | 108.816667 | 60 | 1115.548833 | 323.895833 | 3633.831000 | CORE SPARKLING | REGULAR CALORIE | Plastic Bottle - Contour | 2 LTR 1-Ls 8 |
| 65 | M003403310963 | 104.666667 | 3 | -610.623333 | 95.000000 | 4550.500000 | NaN | NaN | Paper Cup | 32 OZ 32OZ CUP480 PAPER |
| 68 | M093708470229 | 100.966102 | 59 | 685.979661 | 203.237288 | 2262.545085 | CORE SPARKLING | REGULAR CALORIE | Plastic Bottle - Dimple | 2 LTR 1-Ls 8 |
| 127 | M088201310973 | 88.500000 | 2 | 1061.055000 | 167.000000 | 9811.250000 | NaN | NaN | Paper Cup | 16 OZ 1-Ls |
Data cleaning: remove duplicates and replaces missing values with zeros. Then it applies a log transformation on some columns that have high skewness.
Feature engineering: frequency encoding to categorical features (i.e., "BEV_CAT_DESC", "CALORIE_CAT_DESC", "PACK_TYPE_DESC", and "PACK_SIZE_SALES_UNIT_DESCRIPTION") to convert them into numerical values, which can be used in the clustering algorithm.
Data scaling: StandardScaler from scikit-learn to scale the data so that all features have the same range of values. This step is essential in the clustering process as it ensures that no feature is more important than another.
Clustering: KMeans clustering, which groups similar products into clusters based on their characteristics. It tries different numbers of clusters and calculates the Within-Cluster-Sum-of-Squares (WCSS) metric to determine the best number of clusters to use.
The output of this isthe list of WCSS values for different numbers of clusters. we can use this output to determine the optimal number of clusters based on the elbow method. The elbow method helps to identify the point where the decrease in WCSS begins to slow down, which is the optimal number of clusters to use for the given dataset.
# Create clusters of products based on NUM_OF_TRANSACTIONS #Customers AVG_GROSS_PROFIT_DEAD_NET AVG_VOLUME AVG_INVOICE BEV_CAT_DESC CALORIE_CAT_DESC PACK_TYPE_DESC PACK_SIZE_SALES_UNIT_DESCRIPTION
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
#drop null values
#df6 = df5.dropna()
#drop duplicates
df6 = df5.drop_duplicates()
# do log transformation if -inf or inf
df6['AVG_VOLUME'] = np.log(df6['AVG_VOLUME'])
df6['AVG_VOLUME'] = df6['AVG_VOLUME'].replace([np.inf, -np.inf], np.nan)
df6['AVG_INVOICE']= np.log(df6['AVG_INVOICE'])
df6['AVG_INVOICE'] = df6['AVG_INVOICE'].replace([np.inf, -np.inf], np.nan)
df6['AVG_GROSS_PROFIT_DEAD_NET'] = np.log(df6['AVG_GROSS_PROFIT_DEAD_NET'])
df6['AVG_GROSS_PROFIT_DEAD_NET'] = df6['AVG_GROSS_PROFIT_DEAD_NET'].replace([np.inf, -np.inf], np.nan)
df6['NUM_OF_TRANSACTIONS'] = np.log(df6['NUM_OF_TRANSACTIONS'])
df6['NUM_OF_TRANSACTIONS'] = df6['NUM_OF_TRANSACTIONS'].replace([np.inf, -np.inf], np.nan)
df6['#Customers'] = np.log(df6['#Customers'])
df6['#Customers'] = df6['#Customers'].replace([np.inf, -np.inf], np.nan)
#drop 'PRODUCT_SOLD_BLINDED'
df6 = df6.drop('PRODUCT_SOLD_BLINDED', axis=1)
#df6 nan values to 0
df6 = df6.fillna(0)
#encode categorical variables
#df6['BEV_CAT_DESC'] = LabelEncoder().fit_transform(df6['BEV_CAT_DESC'])
#df6['CALORIE_CAT_DESC'] = LabelEncoder().fit_transform(df6['CALORIE_CAT_DESC'])
#df6['PACK_TYPE_DESC'] = LabelEncoder().fit_transform(df6['PACK_TYPE_DESC'])
#df6['PACK_SIZE_SALES_UNIT_DESCRIPTION'] = LabelEncoder().fit_transform(df6['PACK_SIZE_SALES_UNIT_DESCRIPTION'])
#frequency encode categorical variables using a function if nan then 0
def frequency_encoding(df, col):
freq_enc_dict = df[col].value_counts().to_dict()
df[col] = df[col].map(lambda x: freq_enc_dict.get(x, 0))
return df
#frequency encode categorical variables
df6 = frequency_encoding(df6, 'BEV_CAT_DESC')
df6 = frequency_encoding(df6, 'CALORIE_CAT_DESC')
df6 = frequency_encoding(df6, 'PACK_TYPE_DESC')
df6 = frequency_encoding(df6, 'PACK_SIZE_SALES_UNIT_DESCRIPTION')
#scale data
scaler = StandardScaler()
df6 = scaler.fit_transform(df6)
#find the optimal number of clusters
wcss = []
for i in range(1, 20):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(df6)
wcss.append(kmeans.inertia_)
# plot elobw using plotly
fig = px.line(x=range(1, 20), y=wcss, title='The Elbow Method')
fig.show()
This chart indicates a cluster of 6 will be a good aproach. Is where the explained variation changes
# create the cluster model
kmeans = KMeans(n_clusters=6, init='k-means++', random_state=42)
kmeans.fit(df6)
#df5['Cluster'] = kmeans.labels_.tolist()
df5 = df5.assign(Cluster=kmeans.labels_)
#plot clusters
fig = px.scatter_3d(df5, x='NUM_OF_TRANSACTIONS', y='AVG_GROSS_PROFIT_DEAD_NET', z='AVG_INVOICE', color='Cluster',\
title='Clusters of Products', hover_data=df5.columns)
fig.show()
Using the activation date on the customers Demographics and the max date posted(created on the previous feature Engineering)
# creat a new column for the log of the difference of months between max_date and on_board_date in df4
df4['longetivity'] = (df4['MAX_DATE'] - df4['ON_BOARDING_DATE']).dt.days/365
#histogram of longetivity showing the median
fig = px.histogram(df4, x='longetivity', nbins=100, title='Distribution of Longetivity')
fig.add_vline(x=df4['longetivity'].median(), line_width=3, line_dash="dash", line_color="green")
fig.show()
# describe clusters
df7=df5.groupby('Cluster')['NUM_OF_TRANSACTIONS', '#Customers', 'AVG_GROSS_PROFIT_DEAD_NET', 'AVG_VOLUME', 'AVG_INVOICE'].mean()
# of products in each cluster
df7['#Products'] = df5.groupby('Cluster')['PRODUCT_SOLD_BLINDED'].count().reset_index()['PRODUCT_SOLD_BLINDED']
# most frequent of types of products in each cluster
df7['BEV_CAT_DESC'] = df5.groupby('Cluster')['BEV_CAT_DESC'].agg(lambda x:x.value_counts().index[0])
df7['CALORIE_CAT_DESC'] = df5.groupby('Cluster')['CALORIE_CAT_DESC'].agg(lambda x:x.value_counts().index[0])
df7['PACK_TYPE_DESC'] = df5.groupby('Cluster')['PACK_TYPE_DESC'].agg(lambda x:x.value_counts().index[0])
df7['PACK_SIZE_SALES_UNIT_DESCRIPTION'] = df5.groupby('Cluster')['PACK_SIZE_SALES_UNIT_DESCRIPTION'].agg(lambda x:x.value_counts().index[0])
#less frequent of types of products in each cluster
df7['BEV_CAT_DESC2'] = df5.groupby('Cluster')['BEV_CAT_DESC'].agg(lambda x:x.value_counts().index[-1])
df7['CALORIE_CAT_DESC2'] = df5.groupby('Cluster')['CALORIE_CAT_DESC'].agg(lambda x:x.value_counts().index[-1])
df7['PACK_TYPE_DESC2'] = df5.groupby('Cluster')['PACK_TYPE_DESC'].agg(lambda x:x.value_counts().index[-1])
df7['PACK_SIZE_SALES_UNIT_DESCRIPTION2'] = df5.groupby('Cluster')['PACK_SIZE_SALES_UNIT_DESCRIPTION'].agg(lambda x:x.value_counts().index[-1])
df7
| NUM_OF_TRANSACTIONS | #Customers | AVG_GROSS_PROFIT_DEAD_NET | AVG_VOLUME | AVG_INVOICE | #Products | BEV_CAT_DESC | CALORIE_CAT_DESC | PACK_TYPE_DESC | PACK_SIZE_SALES_UNIT_DESCRIPTION | BEV_CAT_DESC2 | CALORIE_CAT_DESC2 | PACK_TYPE_DESC2 | PACK_SIZE_SALES_UNIT_DESCRIPTION2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||||
| 0 | 16.739553 | 1081.587786 | 179.175388 | 28.252418 | 534.159858 | 262 | CORE SPARKLING | REGULAR CALORIE | Aluminum Can | 16 OZ 1-Ls 24 | SPORTS DRINKS | LOW CALORIE | Plastic Bottle - Dimple | 20 OZ 1-Ls 12 |
| 1 | 3.759816 | 70.614706 | 27.228056 | 6.810729 | 139.037550 | 340 | ENERGY DRINKS | REGULAR CALORIE | Aluminum Can | 16 OZ 4-Pk 24 | JUICES/NECTARS | LOW CALORIE | Glass Bottle - Dimple | 1 LTR 6-Pk 12 |
| 2 | 12.267936 | 398.000000 | 1698.628410 | 136.598598 | 6906.256432 | 186 | PACKAGED WATER (PLAIN & ENRICHED) | LOW CALORIE | Paper Cup | 32 OZ 1-Ls | PACKAGED WATER (PLAIN & ENRICHED) | LOW CALORIE | Tank | 20 OZ 1-Ls 1000 |
| 3 | 19.042104 | 624.950980 | 1773.613183 | 670.512028 | 6869.723816 | 204 | CORE SPARKLING | REGULAR CALORIE | Aluminum Can - Sleek | 7.5 OZ 6-Pk 24 | PACKAGED WATER (PLAIN & ENRICHED) | LOW CALORIE | Tetra Pak | 8.45 OZ 1-Ls 24 |
| 4 | 16.799252 | 1002.384127 | 133.051617 | 35.821840 | 826.219204 | 315 | SPORTS DRINKS | REGULAR CALORIE | Plastic Bottle - Other | 2.5 GALLON 1-Ls | COFFEE | LOW CALORIE | Plastic Cartridge | 12 OZ 12-Pk 12 |
| 5 | 2.494128 | 28.701195 | -2.635169 | 0.108233 | -4.030685 | 251 | CORE SPARKLING | REGULAR CALORIE | Plastic Bottle - Other | 20 OZ 1-Ls 24 | DAIRY/SOY BEVERAGES | LOW CALORIE | Tetra Pak | 16 OZ 10-Pk 20 |
#send to postgres df5 as table 'product'
df5.to_sql('product', engine, if_exists='replace', index=False)
558
#cjoin df5 with df2 and longetivity
df11= df2[['CUSTOMER_NUMBER_BLINDED', 'PRODUCT_SOLD_BLINDED','MIN_POSTING_DATE', 'MAX_POSTING_DATE','PHYSICAL_VOLUME',
'INVOICE_PRICE','GROSS_PROFIT_DEAD_NET','NUM_OF_TRANSACTIONS']]\
.merge(df5[['PRODUCT_SOLD_BLINDED', 'Cluster']], on='PRODUCT_SOLD_BLINDED', how='left')\
.merge(df4[['CUSTOMER_NUMBER_BLINDED', 'longetivity']], on='CUSTOMER_NUMBER_BLINDED', how='left')
#creat a new column with the max date group by customer_number_blinded
df11['MAX_DATE'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['MAX_POSTING_DATE'].transform('max')
#creat a new column with the min date group by customer_number_blinded
df11['MIN_DATE'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['MIN_POSTING_DATE'].transform('min')
#difference between max and min date
df11['DIFF_DATE'] = (df11['MAX_DATE'] - df11['MIN_DATE']).dt.days
#if df2['DIFF_DATE'] == 0 then 1 day else df2['DIFF_DATE']
df11['DIFF_DATE'] = np.where(df11['DIFF_DATE'] == 0, 1 , df11['DIFF_DATE'])
#convert the difference to months and round to 0 decimals
df11['years_diff'] = round(df11['DIFF_DATE']/360,4)
# create a new column that adds the total of invoices per customer_number_blinded
df11['TOTAL_INVOICES'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['INVOICE_PRICE'].transform('sum')
# SUM NUMBER OF TRANSACTIONS BY CUSTOMER_NUMBER_BLINDED BY years
df11['EXPECTED_TRANS_YEAR'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['NUM_OF_TRANSACTIONS'].transform('sum')/df11['years_diff']
# create new column total GROSS_PROFIT_DEAD_NET by customer_number_blinded
df11['TOTAL_PROFIT'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['GROSS_PROFIT_DEAD_NET'].transform('sum')
# create new column total volume by customer_number_blinded
df11['TOTAL_VOLUME'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['PHYSICAL_VOLUME'].transform('sum')
#revclvt
df11['revclvt'] = df11['TOTAL_INVOICES']/df11['years_diff']*df11['longetivity']
#volclvt
df11['volclvt'] = df11['TOTAL_VOLUME']/df11['years_diff']*df11['longetivity']
#profitclvt
df11['profitclvt'] = df11['TOTAL_PROFIT']/df11['years_diff']*df11['longetivity']
#clusters as string
df11['Cluster'] = df11['Cluster'].astype(str)
#drop duplicates
df11 = df11.drop_duplicates()
df11=df11.drop(['MAX_DATE','MIN_DATE','DIFF_DATE','years_diff','PRODUCT_SOLD_BLINDED','MIN_POSTING_DATE', 'MAX_POSTING_DATE','PHYSICAL_VOLUME',
'INVOICE_PRICE','GROSS_PROFIT_DEAD_NET','NUM_OF_TRANSACTIONS', 'TOTAL_PROFIT', 'TOTAL_VOLUME','longetivity', 'TOTAL_INVOICES'], axis=1)
df11.columns
#pivot table each column cluster and cutomer_number_blinded in rows and the values no agg
df12 = df11.pivot_table(index=['CUSTOMER_NUMBER_BLINDED'], columns='Cluster', values=['EXPECTED_TRANS_YEAR', 'revclvt','volclvt', 'profitclvt'] ).reset_index()
#multiindex to single index
df12.columns = ['_'.join(col).strip() for col in df12.columns.values]
#rename customer_number_blinded_ as customer_number_blinded
df12.rename(columns={'CUSTOMER_NUMBER_BLINDED_': 'CUSTOMER_NUMBER_BLINDED'}, inplace=True)
df12 = df12.fillna(0)
df12.head()
| CUSTOMER_NUMBER_BLINDED | EXPECTED_TRANS_YEAR_0 | EXPECTED_TRANS_YEAR_1 | EXPECTED_TRANS_YEAR_2 | EXPECTED_TRANS_YEAR_3 | EXPECTED_TRANS_YEAR_4 | EXPECTED_TRANS_YEAR_5 | profitclvt_0 | profitclvt_1 | profitclvt_2 | ... | revclvt_2 | revclvt_3 | revclvt_4 | revclvt_5 | volclvt_0 | volclvt_1 | volclvt_2 | volclvt_3 | volclvt_4 | volclvt_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C0001004007230617 | 51.413882 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 1 | C0001004306830869 | 1428.571429 | 0.0 | 0.000000 | 357.142857 | 357.142857 | 0.0 | 0.000000 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 2 | C0001005009010198 | 0.000000 | 0.0 | 0.000000 | 30.393084 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | ... | 0.0 | 3975.667177 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 306.961820 | 0.000000 | 0.0 |
| 3 | C0001006902500643 | 274.142658 | 0.0 | 714.285714 | 0.000000 | 76.553044 | 0.0 | 11400.605126 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 6717.488658 | 0.0 | 730.978120 | 0.0 | -82966.731898 | 0.000000 | 305.408667 | 0.0 |
| 4 | C0001007600510372 | 357.142857 | 0.0 | 7.706281 | 16.090105 | 12.109870 | 0.0 | 20956.262231 | 0.0 | 0.0 | ... | 0.0 | 3851.972526 | 1753.093001 | 0.0 | 933.463796 | 0.0 | -96.393355 | 39.426267 | 34.528963 | 0.0 |
5 rows × 25 columns
We created a cluster of locations to identify the hot zones that are commercial areas and will help to determine if a location is a valuble feature for success
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from sklearn.model_selection import ParameterGrid
from sklearn.preprocessing import LabelEncoder
import numpy as np
#drop null values
df9 = df4.dropna()
#drop duplicates
df8 = df9.drop_duplicates(['CUSTOMER_NUMBER_BLINDED'])
#drop 'PRODUCT_SOLD_BLINDED'
df8 = df8.drop('CUSTOMER_NUMBER_BLINDED', axis=1)[['GEO_LONGITUDE', 'GEO_LATITUDE',"ZIPCODE"]]
#df8['ADDRESS_CITY'] = LabelEncoder().fit_transform(df8['ADDRESS_CITY'])
# convert customer_data to a numpy array
X = np.array(df8)
# define a range of hyperparameters to search over
param_grid = {'eps': [0.05 , 0.1, 0.5, 1], 'min_samples': [2,3]}
# create a list to store the silhouette scores for each combination of hyperparameters
scores = []
# loop over each combination of hyperparameters
for params in ParameterGrid(param_grid):
# fit the DBSCAN model with the current hyperparameters
dbscan = DBSCAN(**params).fit(X)
# calculate the silhouette score for the current clustering
labels = dbscan.labels_
#score = silhouette_score(X, labels)
score=calinski_harabasz_score(X, labels)
# store the silhouette score for the current hyperparameters
scores.append({'params': params, 'score': score})
# find the hyperparameters with the highest silhouette score
best_params = max(scores, key=lambda x: x['score'])['params']
# fit the DBSCAN model with the best hyperparameters
best_dbscan = DBSCAN(**best_params).fit(X)
# get the labels assigned to each data point
labels = dbscan.labels_
df9 = df9.assign(location_Cluster=dbscan.labels_)
df9['location_Cluster'] = df9['location_Cluster'].astype(str)
#plot clusters in a us map
import plotly.express as px
fig = px.scatter_mapbox(df9, lat="GEO_LATITUDE", lon="GEO_LONGITUDE", color="location_Cluster", hover_name="COUNTY",\
hover_data=["ADDRESS_CITY", "SALES_STATE", "ZIPCODE"], zoom=3, height=600)
fig.update_layout(mapbox_style="open-street-map")
fig.show()
#drop high correlation columns
df10=df9.drop(['GEO_LATITUDE', 'GEO_LONGITUDE','ADDRESS_ZIP_CODE','DELIVERY_CITY', 'DELIVERY_STATE', 'ZIPCODE','TMRR','SALES_OFFICE_DESCRIPTION',
'DELIVERY_PLANT_DESCRIPTION','SALES_CITY', 'SALES_STATE','CUSTOMER_TRADE_CHANNEL_DESCRIPTION',
'CUSTOMER_SUB_TRADE_CHANNEL_DESCRIPTION','CUSTOMER_TRADE_CHANNEL_DESCRIPTION2'], axis=1)
(Average Monetary Value per transaction) x (Expected Number of transactions per year) x (Expected Tenure in years) . Monetary value can be Revenue or Profit or Volume per transaction Customer Lifetime Value (CLV or CLTV) is the average revenue you can generate from customers over the entire lifetime of their account. In simple terms, it is the money you would make from a customer before churning.
For this we will create 3 features for meassuring the customer live value But our target is to predict the CLTV of a customer in 3 year, So we can have a comparable value of the value in 3 years of each customer. We will change the longetivity of the customer to 3 years and the number of transactions to 3 years.
#revclvt
df10['revclvt'] = df10['TOTAL_INVOICES']/df10['years_diff']*df10['longetivity']
#volclvt
df10['volclvt'] = df10['TOTAL_VOLUME']/df10['years_diff']*df10['longetivity']
#profitclvt
df10['profitclvt'] = df10['TOTAL_PROFIT']/df10['years_diff']*df10['longetivity']
#profitclvt 3 years
df10['profitclvt3'] = df10['TOTAL_PROFIT']/df10['years_diff']*3
#join df10 and df12
df13 = pd.merge(df10, df12, on='CUSTOMER_NUMBER_BLINDED', how='left')
#drop duplicates
df13 = df13.drop_duplicates()
df13=df13.drop([ 'years_diff','MAX_DATE', 'MONTHS_ACTIVE', 'ON_BOARDING_DATE','ADDRESS_CITY', 'COUNTY','MARKET_DESCRIPTION','BUSINESS_TYPE_EXTENSION_DESCRIPTION'], axis=1)
df13.head()
| CUSTOMER_NUMBER_BLINDED | ACTIVE | UNIQUE_PRODUCTS | EXPECTED_TRANS_YEAR | TOTAL_PROFIT | TOTAL_VOLUME | TOTAL_INVOICES | CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION | COLD_DRINK_CHANNEL_DESCRIPTION | TMRR_MAX | ... | revclvt_2 | revclvt_3 | revclvt_4 | revclvt_5 | volclvt_0 | volclvt_1 | volclvt_2 | volclvt_3 | volclvt_4 | volclvt_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C0051046109640797 | Active | 55 | 1211.277410 | 14833.42 | 1543.0 | 43793.98 | Other Shopping & Ser | RETAIL | 61.152265 | ... | 0.0 | 4.399003e+02 | 1.947757e+05 | 0.0 | 6087.788118 | 0.000000 | -309.152256 | 10.594902 | 8494.805882 | 0.0 |
| 1 | C0348074302380406 | Active | 293 | 9863.258913 | 196626.22 | 65981.0 | 797839.25 | Grocery Shopping | WHOLESALE | 1114.072689 | ... | 0.0 | 1.450065e+06 | 1.146299e+06 | 0.0 | 28136.950882 | 1291.579023 | -3011.338920 | 113754.912292 | 109013.238172 | 0.0 |
| 2 | C0277089703710223 | Active | 67 | 2137.848469 | 20219.51 | 5908.0 | 101043.78 | Third Party (Non-Con | WHOLESALE | 242.641240 | ... | 0.0 | 1.992113e+04 | 5.460199e+04 | 0.0 | 1129.758616 | 41.880744 | -20.673599 | 2125.450872 | 2859.126914 | 0.0 |
| 3 | C0112072503650635 | Active | 117 | 3301.570496 | 4698.02 | 715.0 | 16176.86 | Other Shopping & Ser | RETAIL | 80.167835 | ... | 0.0 | 1.608157e+02 | 7.755052e+03 | 0.0 | 281.224392 | 14.213507 | -24.901700 | 6.091503 | 446.710226 | 0.0 |
| 4 | C0305009008200279 | Active | 16 | 875.640197 | 5440.52 | 406.0 | 17579.21 | Eating & Drinking | EATING/DRINKING | 24.546947 | ... | 0.0 | 3.846574e+04 | 4.009032e+03 | 0.0 | 3025.666455 | 0.000000 | -5163.856583 | 3608.459352 | 330.249308 | 0.0 |
5 rows × 40 columns
df13.describe()
| UNIQUE_PRODUCTS | EXPECTED_TRANS_YEAR | TOTAL_PROFIT | TOTAL_VOLUME | TOTAL_INVOICES | TMRR_MAX | longetivity | revclvt | volclvt | profitclvt | ... | revclvt_2 | revclvt_3 | revclvt_4 | revclvt_5 | volclvt_0 | volclvt_1 | volclvt_2 | volclvt_3 | volclvt_4 | volclvt_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 34997.000000 | 34997.000000 | 3.499700e+04 | 34997.000000 | 3.499700e+04 | 34997.000000 | 34997.000000 | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | ... | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | 3.499700e+04 | 34997.000000 |
| mean | 22.799326 | 360.513414 | 7.014707e+03 | 858.924912 | 2.202297e+04 | 50.412595 | 8.213564 | 1.540444e+05 | 5.791572e+03 | 4.865545e+04 | ... | 5.123854e+03 | 5.401059e+04 | 5.774288e+04 | 3.382194e+02 | 2.704254e+03 | 3.088674e+02 | -3.593143e+03 | 1.541770e+03 | 2.844851e+03 | 21.651934 |
| std | 25.451121 | 700.854277 | 3.564474e+04 | 7408.197039 | 1.309124e+05 | 236.281483 | 8.283451 | 1.181153e+06 | 5.961992e+04 | 3.371066e+05 | ... | 8.055127e+04 | 4.583457e+05 | 5.949367e+05 | 2.190911e+04 | 3.153476e+04 | 1.035769e+04 | 3.105018e+04 | 2.641386e+04 | 4.856405e+04 | 833.568875 |
| min | 1.000000 | 1.858851 | -7.950499e+04 | -8016.000000 | 1.000000e-02 | 0.000333 | 0.002740 | 3.868996e-02 | -5.790822e+05 | -3.016180e+06 | ... | -1.244932e+06 | -1.624556e+06 | -1.650220e+05 | -1.317070e+06 | -1.066145e+04 | -7.210959e+04 | -2.133591e+06 | -2.027397e+04 | -4.755382e+03 | -46624.266145 |
| 25% | 9.000000 | 59.644265 | 6.274000e+02 | 31.000000 | 1.652070e+03 | 5.452338 | 2.013699 | 5.578197e+03 | 1.061732e+02 | 2.114215e+03 | ... | 0.000000e+00 | 0.000000e+00 | 1.209513e+03 | 0.000000e+00 | 2.806732e+01 | 0.000000e+00 | -4.142466e+02 | 0.000000e+00 | 4.011569e+01 | 0.000000 |
| 50% | 15.000000 | 153.246753 | 1.945760e+03 | 115.000000 | 5.093940e+03 | 12.743426 | 5.375342 | 2.101694e+04 | 4.695053e+02 | 7.877857e+03 | ... | 0.000000e+00 | 3.564559e+03 | 5.162063e+03 | 0.000000e+00 | 1.650972e+02 | 0.000000e+00 | -3.818336e+01 | 6.393164e+01 | 1.833032e+02 | 0.000000 |
| 75% | 26.000000 | 368.324125 | 5.381650e+03 | 378.000000 | 1.462452e+04 | 30.703487 | 11.353425 | 6.969400e+04 | 1.826380e+03 | 2.582853e+04 | ... | 0.000000e+00 | 2.127043e+04 | 1.991277e+04 | 0.000000e+00 | 8.388149e+02 | 0.000000e+00 | 0.000000e+00 | 3.775722e+02 | 7.699032e+02 | 0.000000 |
| max | 465.000000 | 23308.408438 | 2.528506e+06 | 509144.000000 | 7.181258e+06 | 10000.000000 | 41.441096 | 9.220280e+07 | 5.299795e+06 | 2.108559e+07 | ... | 6.715323e+06 | 2.785597e+07 | 7.658674e+07 | 2.371820e+06 | 2.575256e+06 | 1.774618e+06 | 2.133855e+05 | 3.418162e+06 | 8.036384e+06 | 45737.769080 |
8 rows × 35 columns
#send to postgres df13 as customer_features
df13.to_sql('customer_features', engine, if_exists='replace', index=False)
#df13 to csv
df13.to_csv('customer_features.csv', index=False)
df13.columns
Index(['CUSTOMER_NUMBER_BLINDED', 'ACTIVE', 'UNIQUE_PRODUCTS',
'EXPECTED_TRANS_YEAR', 'TOTAL_PROFIT', 'TOTAL_VOLUME', 'TOTAL_INVOICES',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION',
'COLD_DRINK_CHANNEL_DESCRIPTION', 'TMRR_MAX', 'longetivity',
'location_Cluster', 'revclvt', 'volclvt', 'profitclvt', 'profitclvt3',
'EXPECTED_TRANS_YEAR_0', 'EXPECTED_TRANS_YEAR_1',
'EXPECTED_TRANS_YEAR_2', 'EXPECTED_TRANS_YEAR_3',
'EXPECTED_TRANS_YEAR_4', 'EXPECTED_TRANS_YEAR_5', 'profitclvt_0',
'profitclvt_1', 'profitclvt_2', 'profitclvt_3', 'profitclvt_4',
'profitclvt_5', 'revclvt_0', 'revclvt_1', 'revclvt_2', 'revclvt_3',
'revclvt_4', 'revclvt_5', 'volclvt_0', 'volclvt_1', 'volclvt_2',
'volclvt_3', 'volclvt_4', 'volclvt_5'],
dtype='object')
#df13 correlation matrix
corr = df13.corr()
#plot the heatmap in plotly
import plotly.graph_objects as go
fig = go.Figure(data=go.Heatmap(
z=corr.values,
x=corr.columns,
y=corr.columns,
colorscale='Viridis'))
fig.show()
We will classify the customers in 2 groups: Risky and Non risky. We will use the features created in the previous section to predict the CLTV of the customers and the probability of being risky or not.
First, a diminishing returns curve of profit CLTV 3 is created by sorting the profit CLTV3 values in descending order and calculating the cumulative sum. Outliers are removed from this curve by calculating the first and third quartiles, interquartile range, and lower and upper bounds, and filtering out values outside this range.
Then, the KneeLocator package is used to find the elbow point, which represents the threshold for a good profit CLTV. The KneeLocator package takes in the indices and values of the cumulative sum curve, and uses a heuristic algorithm to find the point of maximum curvature, or the "knee" of the curve. The KneeLocator package also allows specifying the curve type and direction.
Finally, the elbow point is plotted on the cumulative sum curve using Plotly, and the corresponding value for the profit CLTV is printed.
This elbow point is used to create a binary target variable, where 1 represents a customer with a good profit CLTV and 0 represents a customer with a bad profit CLTV.
#diminishing returns curve of profitclvt
pc0 = df13.sort_values('profitclvt3', ascending=False)['profitclvt3'].reset_index()
#calculate outliers
q1 = pc0['profitclvt3'].quantile(0.25)
q3 = pc0['profitclvt3'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
pc0 = pc0[(pc0['profitclvt3'] > lower_bound) & (pc0['profitclvt3'] < upper_bound)]
#profitclvt cumsum
pcv = pc0['profitclvt3'].cumsum().reset_index()
#plot profitclvt in plotly
from kneed import KneeLocator
# Use the KneeLocator package to find the elbow point
kl = KneeLocator(pcv.index, pcv['profitclvt3'], curve="concave", direction="decreasing")
# Plot the elbow point
fig = px.line(pcv, x='index', y='profitclvt3', title='Elbow Method For Optimal k')
fig.add_vline(x=kl.elbow, line_width=3, line_dash="dash", line_color="green")
fig.show()
print(pc0[pc0.index == kl.elbow])
print(pc0.profitclvt3.median())
index profitclvt3 31248 10174 1071.889428 4369.683257918552
We will mix the risk of the customer also depending if they are inactive and there longetivity was less than 3 years and the CLTV is less than the elbow point.
#create a risk category based on profitclvt3 and longevitiy and active
df13['risk'] = np.where(((df13['profitclvt3'] > 1071.889428) | ((df13['longetivity'] > 3) & (df13['ACTIVE']== 'Active'))), 'low', 'high')
#median profitclvt3 for each risk category
df13.groupby('risk')['profitclvt3'].median()
# the median of profitclvt3 is 4369.683257918552
#high 636.307959
#low 5669.904268
risk high 636.307959 low 5669.904268 Name: profitclvt3, dtype: float64
import sklearn as sk
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
from sklearn.model_selection import train_test_split, cross_validate,\
GridSearchCV, cross_val_score, KFold, ParameterGrid
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support,\
accuracy_score, recall_score, precision_score, f1_score,\
confusion_matrix, classification_report,precision_recall_curve
# from sklearn.naive_bayes import GaussianNB
# from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC, LinearSVC
# from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import make_scorer
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier,\
BaggingClassifier, AdaBoostClassifier
from sklearn.metrics import roc_curve, roc_auc_score,\
precision_recall_curve, average_precision_score,\
balanced_accuracy_score
from sklearn.model_selection import GridSearchCV, cross_val_score, ParameterGrid
df14=df13
df14.columns
df14=df14.drop(['ACTIVE','TMRR_MAX','profitclvt3','longetivity'],axis=1)
# Generate y counts and create a vector of y levels
imb_y_counts = df14['risk'].value_counts()
y_levels = imb_y_counts.index
y_counts = df14['risk'].value_counts()
print('In df_imb:','\n')
for i in y_levels:
print(f'{round(100*y_counts[i]/df14.shape[0],2)} percent is {i}','\n')
In df_imb: 94.03 percent is low 5.97 percent is high
We know that the data set is imbalanced so we will use diff methods to oversample the minority class and undersample the majority class.
# create a list of DecisionTreeClassifier estimator objects
DTC3 = DecisionTreeClassifier(max_depth=3,random_state=42)
DTC4 = DecisionTreeClassifier(max_depth=4,random_state=42)
DTC5 = DecisionTreeClassifier(max_depth=5,random_state=42)
DTC_ent4 = DecisionTreeClassifier(criterion='entropy',max_depth=4,random_state=42)
DTC_list = [DTC3,DTC4,DTC_ent4, DTC5]
DTC_name_list = ['DTC3','DTC4','DTC_ent4','DTC5']
# SVC methods take a long time. Try three different kernels.
# Need to include probability=True for generating areas under ROC and precision-recall curves
SVC_rbf = SVC(probability=True)
SVC_linear = SVC(kernel='linear',probability=True)
SVC_poly = SVC(kernel='poly',degree=2,probability=True)
SVC_list = [SVC_rbf,SVC_linear, SVC_poly]
SVC_name_list = ['SVC_rbf','SVC_linear','SVC_poly']
# Reduce to four LogisticRegression objects
lr_lbfgs = LogisticRegression(random_state=42)
lr_newton = LogisticRegression(solver='newton-cg',random_state=42)
lr_lib = LogisticRegression(solver='liblinear',random_state=42)
lr_lib_l1 = LogisticRegression(solver='liblinear',penalty = 'l1',random_state=42)
lr_list = [lr_lbfgs,lr_newton, lr_lib,lr_lib_l1,]
lr_name_list = ['lbfgs_l2','newton_l2','lib_l2','lib_l1',]
# Reduce to four rf objects. Don't work well for CD_imb2 (extreme class-imbalance)
rf_max7 = RandomForestClassifier(max_depth=7,random_state=42)
rf_entropy_max7 = RandomForestClassifier(criterion='entropy',max_depth=7,random_state=42)
rf_list = [rf_max7, rf_entropy_max7]
rf_name_list = ['rf_max7','rf_entropy_max7']
Split data column-wise into target variable, numeric variables and categorical variables Use pd.get_dummies() to generate dummy variables to encode categorical variables Create encoded model input features by joining dummy variables with numeric features Split row wise the target variable series and the encoded input features into train, validation, test sets
Encode categorical columns using pd.get.dummies() and keep track of the number of encoded columns per original dataframe's column
def fun_enc5(df_name,df,y):
# separate target variable and predictors
df_target =df[y]
df_features = df.drop(y, axis=1)
# Separate numeric and catogorical predictors
df_num_features = df_features.select_dtypes(exclude=['object'])
df_cat_features = df_features.select_dtypes(include=['object'])
# extract just object type (string) columns from the dataframe.
# Get the column names of categorical variables as a list to prefix the encoded columns
df_cat_list = df_cat_features.columns.tolist()
# Create a transformer object and fit it to cat_features
# save dummy variables in a non-sparse matrix
# This helps produce meaningful feature names
enc = OneHotEncoder(dtype=np.int8,sparse=False)
arr = enc.fit_transform(df_cat_features)
# convert array to dataframe
# use the encoder instance method get_feature_names to prefix our newly encoded
# columns with the existing column names
df_cat_enc = pd.DataFrame(arr, columns=enc.get_feature_names_out(df_cat_list))
# Added .reset_index(drop=True) to the following data frames
# to avoid creating extra rows by pd.concat() in this tutorial
df_cat_enc = df_cat_enc.reset_index(drop=True)
df_num_features = df_num_features.reset_index(drop=True)
df_cat_enc = pd.DataFrame(arr, columns=enc.get_feature_names_out(df_cat_list))
# Join the sparse dataframe of dummy variables with the data frame of numeric features
df_features_enc = pd.concat([df_cat_enc,df_num_features],axis=1)
print(f'Features names after encoding {df_name}\n')
print(df_features_enc.columns,'\n')
print(f'{df_name} has {df_features_enc.shape[0]} rows, {df_features_enc.shape[1]} columns\n')
return df_features_enc, df_target;
def fun_split_dist(df_name,df_features_enc, df_target, test_pct):
# split the predictors and the target data frame into test (test_pct) and
# train (1-test_pct) dataframes using the target data frame
# Specify either train_size or test_size is sufficient.
# You can change random_state to seed the random number generator differently
X_train, X_test, y_train, y_test = \
train_test_split(df_features_enc, df_target, test_size=test_pct, random_state=42)
# Generate y_counts in train and test sets
train_y_counts = y_train.value_counts()
test_y_counts = y_test.value_counts()
# print y distribution in percentage and frequency for train and test sets
print(f'In {df_name} train set ({(1-test_pct)*100}%):','\n')
for i in y_levels:
print(f'{round(100*train_y_counts[i]/X_train.shape[0],2)}% or {train_y_counts[i]} instances belong to class {i}','\n')
#
print(f'In {df_name} test set ({(test_pct)*100}%):','\n')
for i in y_levels:
print(f'{round(100*test_y_counts[i]/X_test.shape[0],2)}% or {test_y_counts[i]} instances belong to class {i}','\n')
return X_train, y_train, X_test, y_test;
def log_transform(df, column):
df[column] = np.log(df[column])
df[column] = df[column] .replace([np.inf, -np.inf], np.nan)
df[column] = df[column].fillna(0)
return df
#log transform columns
for i in df14.columns:
if df14[i].dtype == 'int64' or df14[i].dtype == 'float64':
df14 = log_transform(df14, i)
#drop duplicates
df14 = df14.drop_duplicates()
#drop 'PRODUCT_SOLD_BLINDED'
df14 = df14.drop('CUSTOMER_NUMBER_BLINDED', axis=1)
df14.columns = df14.columns.astype(str)
CD_imb_features_enc, CD_imb_target= fun_enc5('df_imb',df14,'risk')
X_train, y_train, X_test, y_test = fun_split_dist('df_imb',CD_imb_features_enc, CD_imb_target, 0.2)
Features names after encoding df_imb
Index(['CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_At-Work',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Eating & Drinking',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Educational',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Entertainment/Recrea',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Grocery Shopping',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Other Shopping & Ser',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Third Party (Non-Con',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Travel/Transportatio',
'COLD_DRINK_CHANNEL_DESCRIPTION_AT WORK',
'COLD_DRINK_CHANNEL_DESCRIPTION_DESTINATION VENUE',
...
'revclvt_2', 'revclvt_3', 'revclvt_4', 'revclvt_5', 'volclvt_0',
'volclvt_1', 'volclvt_2', 'volclvt_3', 'volclvt_4', 'volclvt_5'],
dtype='object', length=1402)
df_imb has 34997 rows, 1402 columns
In df_imb train set (80.0%):
93.97% or 26309 instances belong to class low
6.03% or 1688 instances belong to class high
In df_imb test set (20.0%):
94.26% or 6598 instances belong to class low
5.74% or 402 instances belong to class high
In df_imb train set (80.0%):
93.97% or 26309 instances belong to class low
6.03% or 1688 instances belong to class high
In df_imb test set (20.0%):
94.26% or 6598 instances belong to class low
5.74% or 402 instances belong to class high
We will give cost or benefit for predictions to low risk or high risk customer. This values are based on the business rules and the cost of the marketing campaign. and can be changed to optimize the model.
Predicted low risk |Predicted high risk
Actual low |Increase of CLVT |decrease of CLVT (no signing with good customenr)
Actual high |decrease of CLVT |No discount offered
We will start giving a cost of how much the median profit value in 3 years decreases by each type of error. and how much it increase for each correct prediction. This values can be changed with cost of the marketing campaign and the business rules. But for this project we will use this values.
|Predicted high risk |Predicted low risk
Actual high | 0 | -3733
Actual low|- 1300.22101 | 1300.22101
cb_factors1 = np.array([[0,-3733],[ -1300, 1300]])
cb_factors1
array([[ 0, -3733],
[-1300, 1300]])
# Generate pred_test to show new metrics with test sets
clf=LogisticRegression(random_state=42)
clf_f = clf.fit(X_train,y_train)
pred_test = clf_f.predict(X_test)
cf = metrics.confusion_matrix(y_test,pred_test)
cf
#print the confusion matrix with class labels and totals
print(pd.DataFrame(cf, index=['true:high', 'true:low'], columns=['pred:high', 'pred:low']))
pred:high pred:low true:high 193 209 true:low 84 6514
def fun_auprc(y_true, y_pred_prob):
precision_arr, recall_arr, thresholds = precision_recall_curve(y_true, y_pred_prob, pos_label = 'yes')
auprc = round(metrics.auc(recall_arr, precision_arr),2)
return auprc
def fun_avg_net_benefit(y_true, y_pred):
# create the confusion matrix (2x2 array)
cf = confusion_matrix(y_true,y_pred)
# now simply multiply the array with the cost array, sum it up,
# then divide the total by the number of instances
netb_arr = cf* cb_factors1
return(round(netb_arr.sum()/y_true.shape[0],2))
avg_netb = make_scorer(fun_avg_net_benefit)
auprc = make_scorer(fun_auprc, needs_proba=True)
# fun_avg_net_benefit creates a confusion matrix and multiply it
# with the cost array. sum up the result and return it
custom_scoring5 = {'acc':'accuracy','bacc':'balanced_accuracy',
'auroc': 'roc_auc','auprc': auprc,'avg_netb' : avg_netb}
index_names = ['mean','std']
def fun_clf_cv_val_test5(clf, clf_name, X_train,y_train,X_test,y_test,kfold):
# cross_validate an estimator on train data to generate validation performance
scores = cross_validate(clf,X_train,y_train,cv=kfold,scoring=custom_scoring5)
# Generate means and stds of all of the custom_scoring5 metrics
scores_df = pd.DataFrame(scores)
scores_df.columns= ['v_fit_time','v_test_time','v_acc',\
'v_bacc', 'v_auroc', 'v_auprc','v_avg_netb']
# drop the first two columns
scores_df = scores_df.drop(['v_fit_time','v_test_time'],axis=1)
val_mean = round(scores_df.mean(),2)
val_std = round(scores_df.std(),2)
# Save val_mean and val_std as rows in val_results_df
val_results_df = pd.DataFrame(data=(val_mean,val_std), index=index_names)
val_results_df['clf_name'] = clf_name # helps identifies clf in val_results
# Generate test results based on a model trained by the whole train set
clf_f = clf.fit(X_train,y_train)
pred_test = clf_f.predict(X_test)
pred_proba_test_yes = clf_f.predict_proba(X_test)[:,1]
auroc = roc_auc_score(y_test,pred_proba_test_yes)
auprc = fun_auprc(y_test,pred_proba_test_yes)
avg_netb = fun_avg_net_benefit(y_test,pred_test)
test_results_df= round(pd.DataFrame({'t_acc': accuracy_score(y_test,pred_test),\
't_bacc': balanced_accuracy_score(y_test,pred_test),\
't_auroc':auroc,\
't_auprc':auprc,\
't_avg_netb': avg_netb},index=[0]),2)
test_results_df['clf_name'] = clf_name # helps identifies clf in test_results
#Now bundle up validation and test results in a single row
cv_val_test_results_df = pd.DataFrame(
{'clf_name':[clf_name],'validate_results':[val_results_df],\
'test_results':[test_results_df]})
return cv_val_test_results_df
# Define the fun_cv_val_multi_clfs5() function
def fun_cv_val_multi_clfs5(clf_list,clf_name_list,X_train,y_train,X_test,y_test,kfold):
multi_clf_results_list = []
for i in range(0,len(clf_list)):
clf_results_df = fun_clf_cv_val_test5(clf_list[i],clf_name_list[i],X_train,y_train,X_test,y_test,kfold)
multi_clf_results_list.append(clf_results_df)
# force pd.concat() to create a good index
cv_val_multi_clf_results_df = pd.concat(multi_clf_results_list).reset_index(drop=True)
return cv_val_multi_clf_results_df
def fun_multi_clf_reports5(multi_clf_results_df):
val_list = []
test_list = []
# Get validate_results and test_results for each clf
for index, model_row in multi_clf_results_df.iterrows():
val_list.append(model_row['validate_results'])
test_list.append(model_row['test_results'])
# end of for loop
# Create multi_clf_val_report_df and multi_clf_test_report_df with results for a classifier in a single row
multi_clf_val_df = pd.concat(val_list)
multi_clf_val_df = multi_clf_val_df.reset_index(drop=False)
multi_clf_val_df = multi_clf_val_df.rename(columns={"index": "stat_type"})
#
multi_clf_val_mean_df = multi_clf_val_df[multi_clf_val_df['stat_type'].str.match('mean')].reset_index(drop=True)
multi_clf_val_std_df = multi_clf_val_df[multi_clf_val_df['stat_type'].str.match('std')].reset_index(drop=True)
#
multi_clf_test_df = pd.concat(test_list)
multi_clf_test_df = multi_clf_test_df.reset_index(drop=False)
#
return multi_clf_val_mean_df, multi_clf_val_std_df, multi_clf_test_df
def fun_cv_val_test_comparison5(multi_clf_val_mean_df,multi_clf_val_std_df,multi_clf_test_report_df,comp_name):
# show tabular comparisons and sns.catplot() of
# validation and test results by metric
clf_name_df = multi_clf_test_report_df['clf_name']
for i in range(0,len(metric_map.v_metric)):
val_metric_mean_df = multi_clf_val_mean_df[metric_map.v_metric.iloc[i]]
val_metric_std_df = multi_clf_val_std_df[metric_map.v_metric.iloc[i]]
test_metric_df = multi_clf_test_report_df[metric_map.test_metric.iloc[i]]
# create wide_df
wide_df = pd.concat([clf_name_df,test_metric_df, val_metric_mean_df, val_metric_std_df],axis=1)
# rename the columns
wide_df.columns = ['clf_name','test_set_result','val_set_mean',
'val_set_std']
print(metric_map.x_label.iloc[i],'from test (left) and from validation in mean and std (right)\n')
print(wide_df,'\n')
# create long_df
long_df = pd.DataFrame(columns=['clf_name',metric_map.x_label.iloc[i],comp_name])
for r in range(0, len(clf_name_df)):
long_df = long_df.append({'clf_name':clf_name_df.iloc[r],
metric_map.x_label.iloc[i]: test_metric_df.iloc[r],
comp_name:'test_set'},ignore_index=True)
long_df = long_df.append({'clf_name':clf_name_df.iloc[r],
metric_map.x_label.iloc[i]: val_metric_mean_df.iloc[r],
comp_name:'val_set'}, ignore_index=True)
# end of r loop
# use seaborn's catplot() to draw performance from test-set and val-set
# in groups of classifier
g = sns.catplot(
data=long_df, kind="bar",
y="clf_name", x=metric_map.x_label.iloc[i], hue=comp_name, orient='h'
)
# g.set(xlim=(0.0, 1.0))
g.fig.set_figwidth(10)
# end of i loop and function
# create a dataframe of val_metric and scorer and col mappings that will be used
# to select result score from the validation results and test results
metric_map = pd.DataFrame(columns=['v_metric','test_metric','x_label'])
metric_map = metric_map.append({'v_metric' : 'v_acc' , 'test_metric' : 't_acc', 'x_label':'Weighted_accuracy'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_bacc' , 'test_metric' : 't_bacc', 'x_label':'Balanced_accuracy'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_auroc' , 'test_metric' : 't_auroc', 'x_label':'Area_Under_ROC'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_auprc' , 'test_metric' : 't_auprc', 'x_label':'Area_Under_PRC'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_avg_netb' , 'test_metric' : 't_avg_netb', 'x_label':'Average_Net_Benefits'} , ignore_index=True)
metric_map
| v_metric | test_metric | x_label | |
|---|---|---|---|
| 0 | v_acc | t_acc | Weighted_accuracy |
| 1 | v_bacc | t_bacc | Balanced_accuracy |
| 2 | v_auroc | t_auroc | Area_Under_ROC |
| 3 | v_auprc | t_auprc | Area_Under_PRC |
| 4 | v_avg_netb | t_avg_netb | Average_Net_Benefits |
#def fun_cv_val_multi_clf_all_tasks5():
def fun_cv_val_multi_clf_all_tasks5(clf_list, clf_name_list,X_train,y_train,X_test,y_test,kfold,comp_name):
multi_clf_results_df = fun_cv_val_multi_clfs5(clf_list, clf_name_list,X_train,y_train,X_test,y_test,kfold)
multi_clf_val_mean_df, multi_clf_val_std_df, multi_clf_test_report_df = fun_multi_clf_reports5(multi_clf_results_df)
fun_cv_val_test_comparison5(multi_clf_val_mean_df,multi_clf_val_std_df, multi_clf_test_report_df,comp_name)
#lr_name_list = ['lbfgs_l2','newton_l2','lib_l2','lib_l1',]
kfold=3
fun_cv_val_multi_clf_all_tasks5(lr_list,lr_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.96 0.96 0.0
1 newton_l2 0.96 0.96 0.0
2 lib_l2 0.96 0.96 0.0
3 lib_l1 0.96 0.96 0.0
Balanced_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.73 0.74 0.02
1 newton_l2 0.73 0.74 0.00
2 lib_l2 0.72 0.73 0.01
3 lib_l1 0.73 0.74 0.01
Area_Under_ROC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.97 0.97 0.0
1 newton_l2 0.97 0.97 0.0
2 lib_l2 0.97 0.97 0.0
3 lib_l1 0.97 0.97 0.0
Area_Under_PRC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.5 0.5 0.0
1 newton_l2 0.5 0.5 0.0
2 lib_l2 0.5 0.5 0.0
3 lib_l1 0.5 0.5 0.0
Average_Net_Benefits from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 1085.17 1074.61 8.64
1 newton_l2 1083.36 1075.56 3.25
2 lib_l2 1080.91 1072.08 2.74
3 lib_l1 1084.54 1077.35 6.18
#rf_name_list = ['rf_default','rf_entropy_default','rf_max7','rf_entropy_max7']
kfold=3
fun_cv_val_multi_clf_all_tasks5(rf_list,rf_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.94 0.94 0.0
1 rf_entropy_default 0.94 0.94 0.0
Balanced_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.5 0.5 0.0
1 rf_entropy_default 0.5 0.5 0.0
Area_Under_ROC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.95 0.95 0.01
1 rf_entropy_default 0.96 0.96 0.00
Area_Under_PRC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.5 0.5 0.0
1 rf_entropy_default 0.5 0.5 0.0
Average_Net_Benefits from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 1010.96 996.55 0.3
1 rf_entropy_default 1010.96 996.55 0.3
#SVC_name_list = ['SVC_rbf','SVC_linear','SVC_poly']
kfold=3
fun_cv_val_multi_clf_all_tasks5(SVC_list,SVC_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 SVC_rbf 0.96 0.95 0.0
1 SVC_linear 0.96 0.96 0.0
2 SVC_poly 0.95 0.95 0.0
Balanced_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 SVC_rbf 0.68 0.67 0.01
1 SVC_linear 0.75 0.76 0.00
2 SVC_poly 0.64 0.63 0.00
Area_Under_ROC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 SVC_rbf 0.98 0.98 0.0
1 SVC_linear 0.97 0.97 0.0
2 SVC_poly 0.97 0.97 0.0
Area_Under_PRC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 SVC_rbf 0.5 0.5 0.0
1 SVC_linear 0.5 0.5 0.0
2 SVC_poly 0.5 0.5 0.0
Average_Net_Benefits from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 SVC_rbf 1070.09 1055.56 1.61
1 SVC_linear 1080.17 1077.37 0.34
2 SVC_poly 1054.51 1042.64 3.52
#DTC_name_list = ['DTC3','DTC4','DTC_ent4','DTC5']
kfold=3
fun_cv_val_multi_clf_all_tasks5(DTC_list,DTC_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.95 0.95 0.0 1 DTC4 0.96 0.95 0.0 2 DTC_ent4 0.95 0.95 0.0 3 DTC5 0.95 0.96 0.0 Balanced_accuracy from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.70 0.69 0.03 1 DTC4 0.79 0.78 0.03 2 DTC_ent4 0.70 0.74 0.06 3 DTC5 0.77 0.74 0.03 Area_Under_ROC from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.96 0.96 0.0 1 DTC4 0.97 0.96 0.0 2 DTC_ent4 0.97 0.97 0.0 3 DTC5 0.97 0.97 0.0 Area_Under_PRC from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.5 0.5 0.0 1 DTC4 0.5 0.5 0.0 2 DTC_ent4 0.5 0.5 0.0 3 DTC5 0.5 0.5 0.0 Average_Net_Benefits from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 1058.68 1050.71 7.43 1 DTC4 1084.21 1071.19 6.97 2 DTC_ent4 1061.39 1059.51 8.20 3 DTC5 1078.32 1072.51 8.15
Use of grid search to find the best parameters for class weight we will give a higer weight to the minority class to see if the model improves and we will evaluate on the best model of each of the previous models
weight_grid = {'class_weight':[{'high': w} for w in np.linspace(10,100,5)]}
DTC4 = DecisionTreeClassifier(max_depth=4,random_state=42)
SVC_linear = SVC(kernel='linear',probability=True,random_state=42)
rf_entropy_max7 = RandomForestClassifier(criterion='entropy',max_depth=7,random_state=42)
lr_newton = LogisticRegression(solver='newton-cg',random_state=42)
w1_clf_list=[DTC4,SVC_linear,rf_entropy_max7,lr_newton]
w_clf_name_list=['DTC4','SVC_linear','rf_entropy_max7','lr_newton']
def fun_clf_gscv_avg_netb(clf, clf_name, X_train,y_train,X_test,y_test,kfold,weight_grid):
#Based on a single scoring_type, use GridSearchCV() to find the best weight, estimator and score
#Use a scoring_type out of the new metrics
clf_gscv = \
GridSearchCV(clf, weight_grid, refit=True, cv=kfold,scoring=make_scorer(fun_avg_net_benefit)).fit(X_train, y_train)
# Generate test results based on clf_gscv.best_estimator_
pred_y = clf_gscv.best_estimator_.predict(X_test)
test_avg_netb = fun_avg_net_benefit(y_test,pred_y)
# Now bundle up validation and test results in a single row
gscv_avg_netb_df = pd.DataFrame(
{'clf_name':[clf_name],'best_gscv_weight': [clf_gscv.best_params_] , \
'best_gscv_avg_netb':clf_gscv.best_score_,\
'test_avg_netb':test_avg_netb})
return gscv_avg_netb_df
Run to see the results with k fold =3
fun_clf_gscv_avg_netb(DTC4 ,'DTC4',X_train,y_train,X_test,y_test,3,weight_grid)
| clf_name | best_gscv_weight | best_gscv_avg_netb | test_avg_netb | |
|---|---|---|---|---|
| 0 | DTC4 | {'class_weight': {'high': 10.0}} | 998.536667 | 1014.5 |
fun_clf_gscv_avg_netb(SVC_linear ,'SVC_linear',X_train,y_train,X_test,y_test,3,weight_grid)
| clf_name | best_gscv_weight | best_gscv_avg_netb | test_avg_netb | |
|---|---|---|---|---|
| 0 | SVC_linear | {'class_weight': {'high': 10.0}} | 1057.93 | 1067.04 |
fun_clf_gscv_avg_netb(rf_entropy_max7 ,'Srf_entropy_max7',X_train,y_train,X_test,y_test,3,weight_grid)
| clf_name | best_gscv_weight | best_gscv_avg_netb | test_avg_netb | |
|---|---|---|---|---|
| 0 | Srf_entropy_max7 | {'class_weight': {'high': 10.0}} | 929.16 | 934.66 |
fun_clf_gscv_avg_netb(lr_newton ,'lr_newton',X_train,y_train,X_test,y_test,3,weight_grid)
| clf_name | best_gscv_weight | best_gscv_avg_netb | test_avg_netb | |
|---|---|---|---|---|
| 0 | lr_newton | {'class_weight': {'high': 10.0}} | 1064.243333 | 1069.84 |
The data features we used on previous model have some corralation between them. We will run the models with fewer features to see if the model improves, we are selecting mostlyfeatures we will get at the moment of onboarding a new customer.
df15=df13
df15.drop(['ACTIVE','TOTAL_PROFIT', 'TOTAL_VOLUME', 'TOTAL_INVOICES','TMRR_MAX', 'longetivity','revclvt', 'volclvt', 'profitclvt', 'profitclvt3',
'EXPECTED_TRANS_YEAR_0', 'EXPECTED_TRANS_YEAR_1',
'EXPECTED_TRANS_YEAR_2', 'EXPECTED_TRANS_YEAR_3',
'EXPECTED_TRANS_YEAR_4', 'EXPECTED_TRANS_YEAR_5', 'profitclvt_0',
'profitclvt_1', 'profitclvt_2', 'profitclvt_3', 'profitclvt_4',
'profitclvt_5', 'revclvt_0', 'revclvt_1', 'revclvt_2', 'revclvt_3',
'revclvt_4', 'revclvt_5', 'volclvt_0', 'volclvt_1', 'volclvt_2',
'volclvt_3', 'volclvt_4', 'volclvt_5',
],axis=1,inplace=True)
df15.columns
Index(['CUSTOMER_NUMBER_BLINDED', 'UNIQUE_PRODUCTS', 'EXPECTED_TRANS_YEAR',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION',
'COLD_DRINK_CHANNEL_DESCRIPTION', 'location_Cluster', 'risk'],
dtype='object')
#log transform columns
for i in df15.columns:
if df15[i].dtype == 'int64' or df15[i].dtype == 'float64':
df15 = log_transform(df15, i)
#drop duplicates
df15 = df15.drop_duplicates()
#drop 'PRODUCT_SOLD_BLINDED'
df15 = df15.drop('CUSTOMER_NUMBER_BLINDED', axis=1)
df15.columns = df15.columns.astype(str)
CD_imb_features_enc, CD_imb_target= fun_enc5('df_imb',df15,'risk')
X_train, y_train, X_test, y_test = fun_split_dist('df_imb',CD_imb_features_enc, CD_imb_target, 0.2)
Features names after encoding df_imb
Index(['CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_At-Work',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Eating & Drinking',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Educational',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Entertainment/Recrea',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Grocery Shopping',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Other Shopping & Ser',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Third Party (Non-Con',
'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Travel/Transportatio',
'COLD_DRINK_CHANNEL_DESCRIPTION_AT WORK',
'COLD_DRINK_CHANNEL_DESCRIPTION_DESTINATION VENUE',
...
'location_Cluster_992', 'location_Cluster_993', 'location_Cluster_994',
'location_Cluster_995', 'location_Cluster_996', 'location_Cluster_997',
'location_Cluster_998', 'location_Cluster_999', 'UNIQUE_PRODUCTS',
'EXPECTED_TRANS_YEAR'],
dtype='object', length=1372)
df_imb has 34997 rows, 1372 columns
In df_imb train set (80.0%):
93.97% or 26309 instances belong to class low
6.03% or 1688 instances belong to class high
In df_imb test set (20.0%):
94.26% or 6598 instances belong to class low
5.74% or 402 instances belong to class high
#### Logistic Regression evaluation
#lr_name_list = ['lbfgs_l2','newton_l2','lib_l2','lib_l1',]
kfold=3
fun_cv_val_multi_clf_all_tasks5(lr_list,lr_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.94 0.94 0.0
1 newton_l2 0.94 0.94 0.0
2 lib_l2 0.94 0.94 0.0
3 lib_l1 0.94 0.94 0.0
Balanced_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.59 0.58 0.01
1 newton_l2 0.59 0.58 0.01
2 lib_l2 0.59 0.58 0.01
3 lib_l1 0.59 0.58 0.01
Area_Under_ROC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.9 0.89 0.00
1 newton_l2 0.9 0.89 0.00
2 lib_l2 0.9 0.89 0.00
3 lib_l1 0.9 0.89 0.01
Area_Under_PRC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 0.5 0.5 0.0
1 newton_l2 0.5 0.5 0.0
2 lib_l2 0.5 0.5 0.0
3 lib_l1 0.5 0.5 0.0
Average_Net_Benefits from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 lbfgs_l2 1025.61 1010.11 5.14
1 newton_l2 1024.70 1009.79 5.28
2 lib_l2 1024.01 1010.36 5.00
3 lib_l1 1024.75 1011.66 5.48
#### Random Forest evaluation
#rf_name_list = ['rf_default','rf_entropy_default','rf_max7','rf_entropy_max7']
kfold=3
fun_cv_val_multi_clf_all_tasks5(rf_list,rf_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.94 0.94 0.0
1 rf_entropy_default 0.94 0.94 0.0
Balanced_accuracy from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.5 0.5 0.0
1 rf_entropy_default 0.5 0.5 0.0
Area_Under_ROC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.85 0.86 0.02
1 rf_entropy_default 0.89 0.88 0.01
Area_Under_PRC from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 0.5 0.5 0.0
1 rf_entropy_default 0.5 0.5 0.0
Average_Net_Benefits from test (left) and from validation in mean and std (right)
clf_name test_set_result val_set_mean val_set_std
0 rf_default 1010.96 996.55 0.3
1 rf_entropy_default 1010.96 996.55 0.3
#### DecisionTreeClassifier evaluation
#DTC_name_list = ['DTC3','DTC4','DTC_ent4','DTC5']
kfold=3
fun_cv_val_multi_clf_all_tasks5(DTC_list,DTC_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.95 0.94 0.0 1 DTC4 0.94 0.94 0.0 2 DTC_ent4 0.94 0.94 0.0 3 DTC5 0.94 0.94 0.0 Balanced_accuracy from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.55 0.57 0.04 1 DTC4 0.55 0.58 0.04 2 DTC_ent4 0.55 0.55 0.01 3 DTC5 0.56 0.60 0.04 Area_Under_ROC from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.88 0.85 0.01 1 DTC4 0.88 0.86 0.01 2 DTC_ent4 0.90 0.89 0.01 3 DTC5 0.89 0.86 0.01 Area_Under_PRC from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 0.5 0.5 0.0 1 DTC4 0.5 0.5 0.0 2 DTC_ent4 0.5 0.5 0.0 3 DTC5 0.5 0.5 0.0 Average_Net_Benefits from test (left) and from validation in mean and std (right) clf_name test_set_result val_set_mean val_set_std 0 DTC3 1024.77 1008.92 0.99 1 DTC4 1022.70 1009.81 0.64 2 DTC_ent4 1023.01 1008.71 0.58 3 DTC5 1024.46 1009.20 2.69
After conducting exploratory data analysis and creating additional features from the data provided by Swire, we were able to develop a model that predicts the risk of doing business with a customer. This risk variable was calculated using the customer's CLTV over a period of 3 years and the probability of being inactive over the next 3 years.
To calculate the customer's lifetime value, we used their profit over the last 3 years and the expected number of transactions per year over the next 3 years. We then created a feature that predicts the customer's lifetime value in 3 years.
We defined a customer as risky if they were inactive (no orders in the last 3 months) and had been a customer for less than 3 years, and if their CLTV over the next 3 years was less than the elbow point of the diminishing returns curve.
We also used Kmeans to cluster products and predict the customer's CLTV for each product type. This information was used as a feature to predict the risk and to monitor transaction behavior for any potential issues.
We used four models (Logistic Regression, Random Forest, Support Vector Machine, and DecisionTreeClassifier) to predict customer risk, evaluating each model with different parameters. We developed a cost matrix to evaluate the model based on the cost of marketing campaigns and business rules. We also used grid search to find the best parameters for class weight to give a higher weight to the minority class and determine if the model would improve.
We performed feature selection, focusing mainly on the features we would obtain when onboarding a new customer, and removed highly correlated features to improve the model's performance.
The models were evaluated on several metrics, and the data was split into train, validation, and test sets, with a cross-validation of 3 folds. After reviewing the different measures, we will choose the model with the highest average net benefit. We analyzed the confusion matrix and cost matrix to identify the types of errors the model was making and the cost of each error.
|Predicted high risk |Predicted low risk
Actual high | 0 | -3733
Actual low | -1300 | 1300
The cost is related to how it affects the customer's lifetime profit over 3 years. The cost of not signing a good customer is 3733, while the cost of signing a bad customer is 1300. This decision was made based on the median of a high-risk customer, which is 1300 below the median CLTV for 3 years. and a good prediction of a low risk will increase the CLTV by 1300.
The best model was a SVC linear when using all the features of transaction history and products CLVTS. The model had an average net benefit of 1087. in test and 1077.37 on validation. This types of models take clos to 90 minuts to run.
When the minority class was weighted and run using a grid search for the best weight was 10 for all models. And the best model was newton logistic regression with a score of 1064.
And finally we run the models to see there performance with fewer features. The best model was the logistic regression by not high diference from other models. This evaluation shsws that reducing to demographic and characteristics of the customers will not higly affect the model performance and will be easier to implement.
We can keep including more features to the model and run models with different scenarios that drops features to see there performance. We can also run the models with different cost matrix to see how the model performs with different cost of the marketing campaign and business rules. We will create a dashboard containing this new features that will be used to understand the customer base and behavior guiding on how to increase CLTV and reduce the risk of doing business with a customer.
The functions and modeling implemented in this project where done for educational purposes in advance data mining. By profesor Olivia Sheng. The data used in this project was provided by Swire Beverages. The results of this project are not to be used for any other purpose than educational.